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Abstract 

Sparse matrix-vector multiplication (SpMV) is a central building block for scientific software 
and graph applications. Recently, heterogeneous processors composed of different types of cores 
attracted much attention because of their flexible core configuration and high energy efficiency. 
In this paper, we propose a compressed sparse row (CSR) format based SpMV algorithm utilizing 
both types of cores in a CPU-GPU heterogeneous processor. We first speculatively execute 
segmented sum operations on the GPU part of a heterogeneous processor and generate a possibly 
incorrect results. Then the CPU part of the same chip is triggered to re-arrange the predicted 
partial sums for a correct resulting vector. On three heterogeneous processors from Intel, AMD 
and nVidia, using 20 sparse matrices as a benchmark suite, the experimental results show that our 
method obtains significant performance improvement over the best existing CSR-based SpMV 
algorithms. 

Keywords: 

Sparse matrices. Sparse matrix-vector multiplication. Compressed sparse row. Speculative 
execution. Segmented sum. Heterogeneous processor 


1. Introduction 

Sparse matrix-vector multiplication (SpMV) is perhaps the most widely-used non-trivial 
sparse basic linear algebra subprogram (BLAS) in computational science and modeling. The op¬ 
eration multiplies a sparse matrix A of size m x n by a dense vector x of size n and gives a dense 
vector y of size m. Despite its simplicity at the semantic level, an efficient SpMV implementation 
is generally hard, because A’s sparsity structure can be very irregular and unpredictable. 

Compared to CPUs, co-processors (e.g., GPUs and Xeon Phi) promise much higher peak 
floating-point performance and memory bandwidth. Thus a lot of research has focused on ac¬ 
celerating SpMV on co-processors. One straightforward way on utilizing co-processors is to de¬ 
velop all-new sparse matrix formats (e.g., HYB (11], Cocktail (2], JAD (2, ESB |@], BCCOO (2 
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and BRC |@1) for specific hardware architectures. The experimental results showed that these 
formats can provide performance improvement for various SpMV benchmarks. 

However, the completely new formats bring several new problems. The first one is backward- 
compatibility. When the input data are stored in basic formats such as compressed sparse row 
(CSR), a format conversion is required for using the new format based SpMV. In practice, fusing 
a completely new format into well-established toolkits (e.g., PETSc ||3l) for scientific software 
is not a trivial task |Q] because of the format conversion. Moreover, Kumbhar |@] pointed out 
that once an application (in particular a non-linear solver) needs repeated format conversion after 
a fixed small number of iterations, the new formats may degrade overall performance. Further¬ 
more, Langr and Tvrdik 10 demonstrated that isolated SpMV performance is insufficient to 
evaluate a new format. Thus more evaluation criteria, such as format conversion cost and mem¬ 
ory footprint, must be taken into consideration. Secondly, when the SpMV operation is used 
with other sparse building blocks (e.g., sparse matrix-matrix multiplication 11 in i that require 
basic storage formats, using the all-new formats is less feasible. 

To leverage the SpMV performance and the practicality, Liu and Vinter proposed the CSR5 
format lll2l] to extend the basic CSR format. The experimental results showed that the CSR5 
format delivers excellent SpMV performance, but merely needs very short format conversion 
time (a few SpMV operations) and very small extra memory footprint (around 2% of the CSR 
data). Because the CSR5 format shares data with the CSR format, the CSR-based sparse BLAS 
routines can efficiently work with the CSR5 format. However, when a solver only needs a few 
iterations, the CSR5 may not deliver speedups, compared to using the basic CSR data. 

Thereofore, improving performance of SpMV usin g th e most widely supported CSR for¬ 
mat has also gained plenty of attention |[ll S 12 . 11 , ni [3> [l3. [3- Most of the related 
work nm Tsi [3 [isl [3l has focused on improving row block method for the CSR-based 
SpMV. However, these newly proposed approaches are not highly efficient. The main reason 
is that co-processors are designed for load balanced high throughput computation, which is not 
naturally offered by the row pointer information of the CSR format. On the other hand, using seg¬ 
mented sum method for the CSR-based SpMV has been proposed by 0 and been implemented 
in libraries cuDPP |17, 3 211 and Modern GPU 13 for nVidia GPUs. Unlike the row block 
methods, the segmented sum algorithms evenly partition an input matrix A for nearly perfect 
load balancing, and thus may be suitable for a co-processor implementation. But unfortunately, 
this method cannot recognize empty rows and requires more costly global operations. These ex¬ 
tra overheads may offset performance gain of load balanced segmented sum and degrade overall 
SpMV efficiency. 

Recently, heterogeneous processors (which are also known as heterogeneous chip multipro¬ 
cessors) have been designed |22, 23] and implemented |24, 25, 3 22. Si- Compared to ho¬ 
mogeneous processors such as CPUs or GPUs, heterogeneous processors can deliver improved 
overall performance and power efficiency 10 . while sufficient heterogeneous parallelisms exist. 
The main characteristics of heterogeneous processors include unified shared memory and fast 
communication among different types of cores (e.g., CPU cores and GPU cores). Practically, 
heterogeneous system architecture (HSA) 10 . OpenCL |3J.] and CUBA |32| have supplied 
toolkits for programming heterogeneous processors. 

Our work described in this paper particularly focuses on accelerating CSR-based SpMV on 
CPU-GPU heterogeneous processors. The main idea of our SpMV algorithm is first specula¬ 
tively executing SpMV on a heterogeneous processor’s GPU cores targeting high throughput 
computation, and then locally re-arranging resulting vectors by the CPU cores of the same chip 
for low-latency memory access. To achieve load balanced first step computation and to utilize 
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both CPU and GPU cores, we improved the conventional segmented sum method by generating 
auxiliary information (e.g., segment descriptor) at runtime and recognizing empty rows on-the- 
fly. Compared to the row block methods for the CSR-based SpMV, our method delivers load 
balanced computation to achieve higher throughput. Compared with the classic segmented sum 
method for the CSR-based SpMV, our approach decreases the overhead of global synchroniza¬ 
tion and removes pre- and post-processing regarding empty rows. 

This paper makes the following contributions: 

• We propose a fast CSR-based SpMV algorithm that efficiently utilizes different types of 
cores in emerging CPU-GPU heterogeneous processors. 

• We develop an speculative segmented sum algorithm by generating auxiliary information 
on-the-fly and eliminating costly pre- and post-processing on empty rows. 

• We evaluate our CSR-based SpMV algorithm on a widely-adopted benchmark suite and 
achieve stable SpMV performance independent of the sparsity structure of input matrix. 

On a benchmark suite composed of 20 matrices with diverse sparsity structures, our approach 
greatly outperforms the row block methods for the CSR-based SpMV running on GPU cores of 
heterogeneous processors. On an Intel heterogeneous processor, the experimental results show 
that our method obtains up to 6.90x and on average 2.57x speedup over an OpenCL implementa¬ 
tion of the CSR-vector algorithm in CUSP running on its GPU cores. On an AMD heterogeneous 
processor, our approach delivers up to 16.07x (14.43x) and on average 5.61x (4.47x) speedup 
over the fastest single (double) precision CSR-based SpMV algorithm from PARALUTION and 
an OpenCL version of CUSP running on its GPU cores. On an nVidia heterogeneous processor, 
our approach delivers up to 5.91x (6.20x) and on average 2.69x (2.53x) speedup over the fastest 
single (double) precision CSR-based SpMV algorithm from cuSPARSE and CUSP running on 
its GPU cores. 

The paper is organized as follows. We first introduce background knowledge about the CSR 
format, the CSR-based SpMV algorithms and heterogeneous processors in Section 2. Then we 
describe our CSR-based SpMV algorithm in Section 3. Moreover, we give and analyze our 
experimental results in Section 4. We review the related approaches in Section 5. 

2. Background and Motivations 

2.1. CSR Format and CSR-based SpMV algorithms 

The CSR format of a sparse matrix consists of three separate arrays: (1) row pointer array 
of size m -H 1, where m is the number of rows of the matrix, (2) column index array of size nnz, 
where nnz is the number of nonzero entries of the matrix, and (3) value array of size nnz. Hence 
the overall space complexity of the CSR format is 0(m -H nnz). Below we show a sparse matrix 
A of size 6x6 and its CSR representation. 

a 0 b 0 0 c 

d e f 0 0 0 

._00g0h0 
"^“000000 
0 0 0 0 i 0 

0 0 j k I 0 
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row pointer - |^0,3,6,8,8,9,12j 
column index—^,2,5, 0,1,2, 2,4, 4, 2,3,4j 
value — l^a, b, c, d, e, f, g, h, i, j, k, /j. 

Assume we have an input dense vector 

^" = [1,2,3,4,5,6], 

we can obtain a dense vector y by multiplying the sparse matrix A by the vector x: 

= [a + 3^7 + 6c, d + 2e + 3 /, 3g + 5h, 0, 51, 3j + 4k + 5/j. 


The straightforward way to multiply A with x on a multicore processor is assigning a row 
block (i.e., multiple rows) to each core. Since the row pointer array records offset information of 
column index and value of nonzero entries, each core can easily position data in A and x. Then 
generating corresponding entries of y merely needs some simple multiply and add operations. 
For example, we assume that using a six-core processor for the above SpMV operation and each 
core is responsible for one row. We can notice that the cores calculating the first, the second and 
the sixth rows are busier than the other cores. Meanwhile, the core doing the fourth row is ac¬ 
tually idle while the other cores are working. Therefore, the row block method cannot naturally 
handle load balance on multicore processors. On co-processors composed of a large amount 
of lightweight single instruction, multiple data (SIMD) units, the problem can heavily d^rade 
performance of SpMV operation. Even though many strategies, such as vectorization ClSQ, 
data streaming lfT3l . memory coalescing |33[^tatic or dynamic binning QQ, Dynamic Par¬ 
allelism S and dynamic row distribution lll9ll . have been proposed for the row block method, 
it is still impossible to achieve nearly perfect load balancing in general sense, simply since row 
sizes are irregular and unpredictable. 

The other method of computing the CSR-based SpMV is utilizing a segmented sum algo¬ 
rithm. This method first generates a segment descriptor of size nnz- The descriptor marks the 
first nonzero entry of each non-empty row as 1 (or equivalently TRUE) and the other nonzero 
entries as 0 (or equivalently FALSE). Using the above 6-by-6 sparse matrix as an example, we 
have 


segment descriptor — [1,0,0, 1,0,0, 1,0, 1, 1,0,0]. 

Then an element-wise product array of size nnz is allocated and filled by calculating 

product[i] — x[column index[i]] X value[i],i e [0, nnz). 

The third step conducts a segmented sum operation on the product array by using segment in¬ 
formation stored in the segment descriptor. Finally, the sum in each segment is stored to a 
contiguous location in y. 

We can see that the segmented sum method can achieve nearly perfect load balance in the 
nonzero entry space. However, this method has two obvious drawbacks: (1) since the seg¬ 
ment descriptor is binary, this method is unable to recognize empty rows, thus a pre-processing 
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(squeezing out possible empty rows) is required for calculating a “clean” row pointer array, and 
a post-processing (adding zeros to proper locations) is needed for a correct resulting vector y, 
and (2) this method requires more expensive global synchronizations and global memory access 
than the row block method (which needs only one single kernel launch). Therefore, in practice, 
the segmented sum method is not necessarily faster than the row block methods. 


2.2. Heterogeneous Processors 

Compared to homogeneous chip multiprocessors such as CPUs and GPUs, the heterogeneous 
processors are able to combine different types of cores into one chip. Thus heterogeneous pro¬ 
cessors offer more flexibilities in architecture design space. Because of mature CPU and GPU 
architectures and applications, CPU-GPU integrated heterogeneous processor with multiple in¬ 
struction set architectures (ISAs) is the most widely adopted choice. Representatives of this 
model include AMD Accelerated Processing Units (APUs) mm, Intel multi-CPU and GPU 
system-on-a-chip (SoC) devices nVidia Echelon heterogeneous GPU architecture lEsIl . and 
many mobile processors (e.g., nVidia Tegra ll2^ and Qualcomm Snapdragon lE^l f. 

Figure[T]shows two block diagrams of heterogeneous processors used as experimental testbed 
in this paper. In general, a heterogeneous processor consists of four major parts: (1) a group of 
CPU cores with hardware-controlled caches, (2) a group of GPU cores with shared command 
processors, software-controlled scratchpad memory and hardware-controlled caches, (3) shared 
memory management unit, and (4) shared global dynamic random-access memory (DRAM). The 
last level cache of the two types of cores can be separate as shown in Figure [TJa) or shared as 
shown in Figureflfb). 

The CPU cores have higher single-thread performance due to out-of-order execution, branch 
prediction and large amounts of caches. The GPU cores execute massively parallel lightweight 
threads on SIMD units for higher aggregate throughput. The two types of compute units have 
completely different ISAs and separate cache sub-systems. 

In this paper, our experiments run on three different platforms (shown in Table |2]l, the plat¬ 
forms from AMD and nVidia are based on the design of Figure[Tla); the Intel platform uses the 
design of Figure [Hb). Note that in the current AMD APU architecture, although the two types 
of cores have separate last level caches, the GPU cores are able to snoop the last level cache on 
the CPU side. 

Compared to loosely-coupled CPU-GPU heterogeneous systems, the two types of cores in a 
heterogeneous processor share one single unified address space instead of using separate address 
spaces (i.e., system memory space and GPU device memory space). One obvious benefit is 
avoiding data transfer through connection interfaces (e.g ., PCIe link), which is one of the most 
well known bottlenecks of co-processor computing 13411 . Additionally, GPU cores can access 
more memory by paging memory to and from disk. Further, the consistent pageable shared 
virtual memory can be fully or partially coherent, meaning that much more efficient CPU-GPU 
interactions are possible due to eliminated heavyweight synchronization (i.e., flushing and GPU 
cache invalidation). Currently, programming on the unified address space and low-overhead 
kernel launch are supported by HSA OpenCF ImIi and CUDA 


3. New Sparse Matrix-Vector Multiplication Algorithm 

3.1. Data Decomposition 

We first evenly decompose nonzero entries of the input matrix to multiple small tiles for load 
balanced data parallelism. Here we define a tile as a 2D array of size W xT. The width T is the 
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(a) Heterogeneous processor with separate last level cache 



(b) Heterogeneous processor with CPU-GPU shared last level cache 


Figure 1: Block diagrams of two representative heterogeneous processors. 


size of a thread-bunch, which is the minimum SIMD execution unit in a given vector processor. 
It is also known as wavefront in AMD GPUs or warp in nVidia GPUs. The height W is the 
workload (i.e., the number of nonzero entries to be processed) of a thread. A tile is a basic work 
unit in matrix-based segmented sum method 351, which is used as a building block in our 
SpMV algorithm. Actually, the term “tile” is equivalent to the term “matrix” used in original 
description of the segmented scan algorithms 35]. Here we use “tile” to avoid confusion 
between a work unit of matrix shape and a sparse matrix in SpMV. 

Since a thread-bunch can be relatively too small (e.g., as low as 8 in current Intel GPUs) 
to amortize scheduling cost, we combine multiple thread-bunches into one thread-group (i.e., 
work-group in OpenCL terminology or thread block in CUBA terminology) for possibly higher 
throughput. We define B to denote the number of thread-bunches in one thread-group. Addition¬ 
ally, we let each thread-bunch compute S contiguous tiles. Thus higher on-chip resource reuse 
and faster global synchronization are expected. 

Therefore, we can calculate that each thread-group deals with BSWT nonzero entries. Thus 
the whole nonzero entry space of size nnz can be evenly assigned to [nnz/(BSWT)'] thread- 
groups. Figure |2] shows an example of the data decomposition. In this example, we set B = 2, 
S = 2, W - 4, and T - 2. Thus each thread-group is responsible for 32 nonzero entries. Then 
rnnz/32] thread-groups are dispatched. 
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Figure 2: Data decomposition on the nonzero entry space, nnz nonzero entries are assigned to multiple thread-groups. 
In this case, each thread-group consists of 2 thread-bunches (i.e., B = 2). The number of threads in each thread-bunch is 
equal to 2 (i.e., T = 2). The workload per thread is 4 (i.e., W = 4). The number of iterative steps in each thread-bunch is 
2 (i.e., S = 2). 

3.2. Algorithm Description 

Our CSR-based SpMV is based on fundamental segmented sum algorithm, which guarantees 
load balanced computation in the nonzero entry space. While utilizing segmented sum as a 
building block in our SpMV algorithm, we have three main performance considerations: (1) 
the segment descriptor needs to be generated in on-chip memory at runtime to reduce overhead 
of global memory access, (2) empty rows must be recognized and processed without calling 
specific pre- and post-processing functions, and (3) taking advantages of both types of cores in a 
heterogeneous processor. Hence we improve the basic segmented sum method to meet the above 
performance requirements. 

The algorithm framework includes two main stages: (1) speculative execution stage, and (2) 
checking prediction stage. The first stage speculatively executes SpMV operation and generates 
a possibly incorrect resulting vector y. Here the term “incorrect” means that the layout of entries 
in y can be incorrect, but the entries are guaranteed to be numerically identified. Then in the 
second stage we check whether or not the speculative execution is successful. If the prediction is 
wrong, a data re-arrangement will be launched for getting a completely correct y. 

We first give an example of our algorithm and use it in the following algorithm description. 
Figure [3] plots this example. The input sparse matrix includes 12 rows (2 of them are empty) 
and 48 nonzero entries. We set B to 1, 5 to 2, T to 4 and W to 6. This setting means that 
one thread-group is composed of one thread-bunch of size 4; each thread-bunch runs 2 iteration 
steps. Before GPU kernel launch, three containers, synchronizer, dirty-counter and speculator, 
are pre-allocated in DRAM for global synchronization and speculative execution. Algorithm [1] 
in Appendix A lists pseudo code of the first stage. 

The speculative execution stage includes the following steps: (1) positioning a range of row 
indices for nonzero entries in a given tile, (2) calculating segment descriptor based on the range, 
(3) conducting segmented sum on the tile, (4) saving partial sums to the computed index range in 
vector y. This stage also has some input-triggered operations such as labeling a tile with empty 
rows. 

First, each thread-bunch executes binary search of 5 -H 1 tile boundaries on the CSR row 
pointer array. Then we obtain corresponding row indices and store them in a scratchpad array 
tile ojfset of size 5 H-1. The results of the binary search are starting and ending row indices of the 
nonzero entries in each tile. Thus each tile knows the locations to store generated partial sums. 
Lines 3-7 of Algorithm[T]give a code expression of this step. In our example shown in Figure |3] 
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■ Initial state 
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• After speculative pass 1 

y =Q03c.^^qqqgXDGXDO 

• After speculative pass 2 

y =(D®3c0@2f@^^0(T)0 

' After synchronization 






o + an integer: pointer in the nonzero entry space 

□ + an integer: row index 

□ + an integer: counter 

O 3 ietter: eiement-wise product value (not head of a row) 

® + a letter: element-wise product value (head of a row) 

(3 + a letter or its integral multiple: output value (intermediate form) 
^ + a letter or its integral multiple: output value (final form) 

©- add to •- data movement 


Figure 3: An example of our CSR-based SpMV algorithm. The input sparse matrix contains 48 nonzero entries in 12 
rows (10 non-empty rows and 2 empty rows). One thread-hunch composed of 4 threads is launched in this 2-iteration 
process. The arrays synchronizer and speculator store tuples (shown with angular brackets). 


the 2 tiles of size 24 have 3 boundaries {0, 24, 48}. The results of binary search of {0, 24, 48) on 
the CSR row pointer array are {0,4, 12). Note that the binary search needs to return the rightmost 
match, if multiple slots have the same value. 

Then each thread-bunch executes an iteration of S steps. Lines 8-59 of Algorithm [T] give 
code expression of this step. Each iteration deals with one tile. By calculating offset between the 
left boundary of a tile and the covered row indices, a local segment descriptor is generated (lines 
14-21 in Algorithm[TJ. For example, the left boundary of the second tile is 24 and its row index 
range is 4-12. We need to compute offset between 24 and the row pointer {19, 27, 29, 29, 34, 37, 
37, 44, 48). Then we obtain a group of offsets {-5, 3, 5, 5, 10, 13, 13, 20, 24). After removing 
duplicate values and overflowed values on the left and the right sides, the effective part (3, 5, 10, 
13, 20) in fact implies local segment descriptor for the current tile. We can easily convert it to a 
binary expression (0, 0, 0, 1, 0, 1, 0, ... , 0, 0, 1, 0, 0, 0} through a scatter operation in on-chip 
scratchpad memory. Moreover, since each tile is an independent work unit, the first bit of its 
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segment descriptor should be TRUE. Thus the hnal expression becomes {1, 0, 0, 1, 0, 1, 0, ... , 0, 
0, 1,0, 0, 0}. In Figured the hlled and empty circles are heads (i.e.. Is or TRUEs) and body (i.e.. 
Os or FALSEs) of segments, respectively. 

While generating the segment descriptor, each thread detects whether or not its right neighbor 
wants to write to the same slot. If yes (like the duplicate offset information {..., 5, 5, ...} and {..., 
13, 13, ...) in the above example), we can make sure that this tile contains at least one empty 
row, since an empty row is expressed as two contiguous indices of the same value in the CSR 
row pointer array. Then we mark this tile as “dirty” (line 19 in Algorithm [TJ. Further, the 
dirty counter array stored in DRAM is incremented by atomic operation, and this tile’s offset is 
recorded in the speculator array (lines 53-58 in Algorithm[T]). In our example, dirty counter is 1 
and speculator array has a pair of offsets {(4, 12)} ((shown with angular brackets in Figure[3]|. 

Then we calculate and save element-wise products in scratchpad memory, based on its nonzero 
entries’ column indices, values and corresponding values in the vector x. Lines 22-26 of Algo- 
rithm[T]show code expression of this step. When hnished, we transmit the sum of the last segment 
to an intermediate space for the next iteration (lines 27-31 in Algorithm [TJ. In our example, the 
hrst tile’s last value 5e is transmitted to the next tile. Then we execute the matrix-based seg¬ 
mented sum (lines 32-33) on the tile. Because the segmented sum algorithm used here is very 
similar to the method described in ifl^ . we refer the reader to and several pervious GPU 
segmented sum algorithms 35] for details. But note that compared to ||T^ , our method 
makes one difference: we store partial sums in a compact pattern (i.e., values are arranged in 
order from the hrst location in the thread work space), but not save them to locations of corre¬ 
sponding segment heads. For this reason, we need to record the starting position and the number 
of partial sums. Then we can use an ordinary exclusive scan operation (lines 34-35) for obtain¬ 
ing contiguous indices of the partials sums in y. In Figure [3] we can see that the partial sums 
(expressed as hlled hexagons) are aggregated in the compact fashion. Note that empty hexagons 
are intermediate partial sums, which are already added to the correct position of segment heads. 

Finally, we store the partial sums to known locations in the resulting vector. Lines 36-52 
of Algorithm [T] show code expression. As an exception, the sum result of the hrst segment 
in a thread-bunch is stored to the synchronizer array (lines 40-43), since the hrst row of each 
thread-bunch may cross multiple thread-bunch. This is a well known issue while conducting 
basic primitives, such as reduction and prehx-sum scan, using more than one thread-group that 
cannot communicate with each other. In fact, atomic add operation can be utilized to avoid the 
global synchronization. But we choose not to use relatively slow global atomic operations and 
let a CPU core to later on hnish the global synchronization. Lines 62-68 of Algorithm [T] show 
the corresponding code expression. Since the problem size (i.e., [nnz/(S WT)]) can be too small 
to saturate a GPU core, a CPU core is in fact faster for accessing short arrays linearly stored in 
DRAM. Taking the hrst tile in Figure|3]as an example, its hrst partial sum is 3a, which is stored 
with its global index 0 to the synchronizer. After that, the value 3a is added to position 0 of y. 

When the above steps are complete, the resulting vector is numerically identihed, except that 
some values generated by dirty tiles are not in their correct locations. In Figure |2 we can see 
that after synchronization, vector y is already numerically identihed to its hnal form, but entries 
5g, 3h, li and 4 j generated by the second tile are located in wrong slots. 

The checking prediction stage hrst checks value of the dirty counter array. If it is zero, the 
previous prediction is correct and the result of the hrst stage is the hnal result; if it is not zero, the 
predicted entries generated by dirty tiles are scattered to their correct positions in the resulting 
vector. In this procedure, the CSR row pointer array is required to be read for getting correct 
row distribution information. Again, we use a CPU core for the irregular linear memory access, 
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which is more suitable for cache sub-systems in CPUs. In our example, entries 5g, 3h, li and 4 j 
are moved to their correct positions. Then the SpMV operation is done. 

3.3. Complexity Analysis 

Our CSR-based SpMV algorithm pre-allocates three auxiliary arrays, synchronizer, dirty 
counter and speculator, in DRAM. The space complexity of synchronizer is [nnz/iSWT)], 
equivalent to the number of thread-bunches. The size of dirty counter is constant 1. The spec¬ 
ulator array needs a size of [nnz/iWT)], equivalent to the number of tiles. Since W and T are 
typically set to relatively large values, the auxiliary arrays merely slightly increase overall space 
requirement. 

For each thread-bunch, we executes 5 H- 1 binary searches in the row pointer array of size 
m+l. Thus 0{\nnzl(SWTy\y.{S -i-l)xlog 2 (m-i-l)) = 0(nnzlog2(m) IWT) is work complexity of 
this part. On the whole, generating segment descriptor needs 0(m) time. Collecting element-wise 
products needs Oinnz) time. For each tile, segmented sum needs 0{WT -i-log2(7’)) time. Thus all 
segmented sum operations need (9(|'nnz/(lT7’)l(lT7’ H-log2(7’))) = 0{nnz+nnz\og2{T) IWT)\m\t. 
Saving entries to y needs Oim) time. Synchronization takes 0{\nnzl{SWTy\) - 0{nnzlSWT) 
time. Possible re-arrangement needs 0{m) time in the worst case. Thus overall work complexity 
of our CSR-based SpMV algorithm is 0(m -H nnz + nnz(iog 2 (ni) -H log 2 {T)) I WT). 

3.4. Implementation Details 

Based on the above analysis, we can see that when the input matrix is fixed, the cost of our 
SpMV algorithm only depends on two parameters; T and W. In our algorithm implementation, 
T is set to SIMD length of the used processor. Choosing W needs to consider the capacity of 
on-chip scratchpad memory. The other two parameters B and S are empirically chosen. Table [T] 
shows the selected parameters. Note that double precision is not currently supported in Intel 
OpenCL implementation for its GPUs. 


Table 1: The selected parameters 


Processor 

Intel 


AMD 


nVidia 

Precision 

32-bit 

32-bit 

64-bit 

32-bit 

64-bit 

single 

single 

double 

single 

double 

T 

8 

64 

64 

32 

32 

W 

16 

16 

8 

8 

4 

B 

4 

2 

2 

5 

5 

S 

6 

2 

5 

7 

7 


We implement the first stage of our algorithm in OpenCL for the Intel and AMD platforms 
(and CUDA for the nVidia platform) for GPU execution and the second stage in standard C 
language running on the CPU part. Since our algorithm needs CPU and GPU share some arrays, 
we allocate all arrays in Shared Virtual Memory supported by OpenCL for the best performance. 
On the nVidia platform, we use Unified Memory in CUDA SDK. 

4. Experimental Results 

4.1. Experimental Environments 

We use three heterogeneous processors, Intel Core i3-5010U, AMD A10-7850K APU and 
nVidia Tegra Kl, for evaluating SpMV algorithms. Table |2] shows specifications of the three 
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processors. All of them are composed of multiple CPU cores and GPU cores. The two types of 
cores in the Intel heterogeneous processor share a 3 MB last level cache. In contrast, GPU cores 
in the AMD heterogeneous processor can snoop the L2 cache of size 4 MB on the CPU side. 
Unlike those, the cache systems of the CPU part and the GPU part in the nVidia Tegra processor 
are completely separate. Note that currently the Intel GPU can run OpenCL program only on 
Microsoft Windows operating system. Also note that we use kB, MB and GB to denote 2'**, 2^° 
and 2^° bytes, respectively; and use GFlop to denote 10® flops. 


Table 2: The test environments used in our experiments 


Processor 

Intel Core i3-5010U 

AMD A10-78-50K APU 

nVidia Tegra K1 


Core type 

x86 CPU 

GPU 

x86 CPU 

GPU 

ARM CPU 

GPU 

Codename 

Broadwell 

HD 5500 

Steamroller 

GCN 

Cortex A15 

Kepler 

Cores @ clock (GHz) 

2 @ 2.1 

3 (® 0.9 

4@ 3.7 

8 @ 0.72 

4 @ 2.3 

1 @ 0.85 

SP flops/cycle 

2x32 

3x128 

4x8 

8x128 

4x8 

1x384 

SP peak (GFlop/s) 

134.4 

345.6 

118.4 

737.3 

73.6 

327.2 

DP flops/cycle 

2x16 

3x32 

4x4 

8x8 

2x2 

1x16 

DP peak (GFlop/s) 

67.2 

86.4 

59.2 

46.1 

18.4 

13.6 

LI data cache 

4x32 kB 

3x4 kB 

4x16 kB 

8x16kB 

4x32 kB 

lxl6kB 

L2 cache 

4x256 kB 

3x24 kB 

2x2 MB 

Unreleased 

2 MB 

128 kB 

L3 cache 

N/A 

384 kB 

N/A 

N/A 

N/A 

N/A 

Scratchpad 

N/A 

3x64 kB 

N/A 

8x64 kB 

N/A 

1 x48 kB 

Shared last level cache 

3 MB 



N/A 

N/A 


DRAM 

Dual-channel DDR3 

-1600 

Dutd-channel DDR3-1600 

Single-channel DDR3L-1866 

DRAM capacity 

8 GB 



8 GB 

2 GB 


DRAM bandwidth 

25.6 GB/s 



25.6 GB/s 

14.9 GB/s 


Operating system 

Microsoft Windows 64-bit 

Ubuntu Linux 14.04 64-bil 

Ubuntu Linux 14.04 32-bit 

GPU driver version 

15.36 



14.501 

rl9.2 


Compiler 

icc 15.0.2 



gcc 4.8.2 

gcc 4.8.2. nvcc 6.0.1 

Toolkit version 

OpenCL 2.0 



OpenCL 2.0 

CUDA 6.0 



4.2. Benchmark Suite 


To evaluate our method, we choose 20 unstructured matrices from the University of Florida 
Sparse Matrix Collection 11361] . Table |3]lists main information of the evaluated sparse matrices. 
The hrst 14 matrices of the benchmark suite have been widely used in previous work QSIl 
S 12, ISl- The last 6 matrices are chosen as representatives of irregular matrices extracted from 
graph applications, such as circuit simulation and optimization problems. 

The hrst 10 matrices are relatively regular, due to short distance between the average value 
and the maximum value of nnz/row. The other matrices are relatively irregular. In this context, 
‘regular’ is used for a sparse matrix including rows of roughly the same size. In contrast, an ‘ir¬ 
regular matrix’ can have some very long rows and many very short rows. For example, matrices 
generated from power-law graphs can have a few rows with Oin) nonzero entries and many rows 
with (9(1) nonzero entries. 


4.3. Experimental Setup 

To analyze efficiency of the proposed SpMV algorithm, we also benchmark parallel CSR- 
based SpMV using some other libraries or methods on CPUs and GPUs. 

On CPUs, we execute three CSR-based SpMV approaches: (1) OpenMP-accelerated basic 
row block method, (2) pOSKI library iIt^ using OSKI ll^ as a building block, and (3) Intel 
MKL vll.2 Update 2 in Intel Parallel Studio XE 2015 Update 2. The three approaches are 
running on all CPU cores of the used heterogeneous processors. For the Intel CPU, we report 
results from MKL, since it always delivers the best performance and the pOSKI is not supported 
by the used Microsoft Windows operating system. For the AMD CPU, we report the best results 
of the three libraries, since none of the three libraries outperforms all the others. For the ARM 
CPU included in the nVidia Tegra K1 platform, we only report results from OpenMP, since the 
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Table 3: Overview of evaluated sparse matrices 


Name 


Plot 


Dimensions 

iinz 

tinz/row (min, avg, max) 
Name 


Plot 


Dense Protein FEM/Spheres FEM/Canlilever Wind Tunnel 



2Kx2K 36KX36K 83Kx83K 62Kx62K 218Kx218K 

4.0M 4.3M 6.0M 4.0M 1L6M 

2K,2K,2K 18,119,204 1,72,81 1,64,78 2,53,180 


FEM/Harbor QCD FEM/Ship Economics Epidemiology 



Dimensions 

niiz 

nitz/row (min, avg, max) 
Name 


47 K X 47 K 
2.4 M 
4, 50, 145 
FEM/Accelerator 


49 K X 49 K 
1.9 M 
39, 39, 39 
Circuit 


141 Kx 141 K 
7.8M 
24,55, 102 
Webbase 


207 K X 207 K 
1.3 M 
1,6,44 
LP 


526 KX 526 K 
2.1 M 
2,3,4 
ASIC_680k 




current pOSKI and Intel MKL implementations do not support the ARM architecture. Moreover, 
single-threaded naive implementation on CPU is included in our benchmark as well. 

On GPUs, we benchmark variants of the CSR-scalar and the CSR-vector algorithms pro¬ 
posed i n jin . The OpenCL version of the CSR-scalar method is extracted from PARALLUTION 
vl .0.0 13911 and evaluated on the AMD platform. The OpenCL implementation of the CSR-vector 
method is extracted from semantically equivalent CUDA code in the CUSP library vO.4.0 and ex¬ 
ecuted on both the Intel and the AMD platforms. On the nVidia platform, we run the CSR-based 
SpMV from vendor-supplied cuSPARSE v6.0 and CUSP vO.4.0 libraries. 

For all tests, we run SpMV 200 times and record averages. The implicit data transfer (i.e., 
matrices and vectors data copy from their sources to OpenCL Shared Virtual Memory or CUDA 
Unihed Memory) is not included in our evaluation, since SpMV operation is normally one build¬ 
ing block of more complex applications. All participating methods conduct general SpMV, mean¬ 
ing that symmetry is not considered although some input matrices are symmetric. The throughput 
(flops per second) is calculated by 


2 X nnz 
runtime 
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Figure 4: Throughput (GFlop/s) of the single precision CSR-based SpMV algorithms running on the three platforms. 
“CPU (Best, multi-threaded)” shows the best results of OpenMP parallelization, pOSKI and Intel MKL on the AMD 
device. “CSR-scalar” and “CSR-vector” are variants of the row block algorithm on GPUs. “bhSPARSE” shows our 
CSR-based SpMV approach described in this paper. 


The bandwidth (bytes per second) is calculated by 

(m + 1 + nnz) X sizeof{idxJype) + (nnz + nnz + m) * sizeofivalJype) 

runtime 
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Figure 5: Throughput (GFlop/s) of the double precision CSR-based SpMV algorithms running on the AMD and the 
nVidia platforms. 


4.4. Performance Analysis 

Figures|4]and|5]show throughput of single precision and double precision SpMV of the tested 
CSR-based approaches, respectively. 

In Figure m we can see that on the Intel heterogeneous processor, our approach obtains up 
to 6.90x and on average 2.57x speedup over the CSR-vector method running on the used GPU. 
Although the speedup mainly comes from irregular matrices, our method generally does not 
obviously lose performance on regular matrices. Further, compared to CPU cores running MKL, 
both GPU SpMV algorithms are slower. For our algorithm, the main reason is that the integrated 
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GPU implements scratchpad memory in its L3 cache, which has one order of magnitude higher 
latency compared to fast scratchpad in nVidia or AMD GPUs. Our algorithm in fact heavily uses 
scratchpad memory for storing and reusing segment descriptor, element-wise products and other 
shared data by threads. Thus even though the GPU part of the Intel heterogeneous processor 
has higher single precision theoretical peak performance than its CPU part, the delivered SpMV 
throughput is lower than expected. For the CSR-vector method, the low performance has another 
reason; small thread-bunch of size 8 dramatically increases loop overhead iliflll . which is one of 
the well known bottlenecks 114 ill of GPU programming. 

In Figures|4]and|5j we can see that on the AMD heterogeneous processor, our method delivers 
up to 71.90x (94.05x) and on average 22.17x (22.88x) speedup over the single (double) precision 
CSR-scalar method running on the used GPU. Compared to the GPU CSR-vector method, our 
algorithm achieves up to 16.07x (14.43x) and on average 5.61x (4.47x) speedup. The CSR- 
scalar and the CSR-vector methods give very low throughput while running the last 6 irregular 
matrices, because of the problem of load imbalance. Further, we find that the Intel heterogeneous 
processor’s GPU is actually faster than the AMD GPU while running the last 6 matrices. The 
reason is that the shorter thread-bunch (8 in Intel GPU vs. 64 in AMD GPU) brings a positive 
influence for saving SIMD idle cost by executing a much shorter vector width for dramatically 
imbalanced row distribution. On the other hand, for several very regular matrices with short 
rows, e.g.. Epidemiology, the CSR-scalar method offers the best performance because of almost 
perfect load balance and execution of short rows without loop cost. For most regular matrices, 
our method delivers comparable performance over the best CPU algorithm. 

In Figures |4] and |5] we can see that on the nVidia platform, our method delivers up to 
5.91x (6.20x) and on average 2.69x (2.53x) speedup over the single (double) precision SpMV 
in the CUSP library running on the used GPU. Compared to cuSPARSE, our method has higher 
speedups. Since the both libraries use CSR-vector algorithm, those speedups are within expec¬ 
tations. Consider the Tegra K1 platform only contains one single GPU core, the problem of 
load imbalance on this device is not as heavy as on the above AMD platform. As a result, the 
speedups are not as high as those from the AMD processor. Here our method delivers on aver¬ 
age 1.4lx (1.42x) speedup over OpenMP-accelerated SpMV on the quad-core ARM CPU, while 
using single (double) precision benchmark. 

Figurel^shows bandwidth utilization of our algorithm proposed in this paper. We can see that 
the regular matrices can use bandwidth more efficiently compared to the irregular ones. Consid¬ 
ering the throughput speedups listed above, our method can obtain higher bandwidth utilization 
than the other CSR-based SpMV algorithms running on GPUs. 


4.5. Parameter Selection 

We further conduct experiments to exploit how selected parameters influence overall perfor¬ 
mance. 

Figure |7] shows dependency of the overall performance (harmonic means of the 20 bench¬ 
marks) on the parameters, while we fix all the parameters except for parameter W (i.e., workload 
per thread). We can see that in general the overall performance goes up as parameter W increases. 
This trend matches the algorithm complexity analysis described in Section 3.3. However, when 
W is larger than a certain value, the overall performance degrades. The reason is that device oc¬ 
cupancy may decrease while more on-chip scratchpad memory is allocated for WT work space 
of each thread-bunch. 

FigureOshows the trend of the overall performance while we change parameter S (i.e., the 
number of iterations of each thread-bunch) and fix all the other parameters. We can see that if we 
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(c) Single precision and double precision SpMV on nVidia Tegra K1 

Figure 6: Bandwidth utilization (GB/s) of our CSR-based SpMV algorithm running on the three platforms. Theoretical 
bandwidth from the hardware specifications are marked up using black lines. 





(d) AMD, DP 


(e) nVidia, DP 


Figure 7: Single precision (SP) and double precision (DP) SpMV performance of our algorithm on the three platforms 
while parameter W changes and all the others fixed to the best observed values (see Table[D. 
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Figure 8: Single precision (SP) and double precision (DP) SpMV performance of our algorithm on the three platforms 
while parameter S changes and all the others fixed to the best observed values(see Table[T). 


assign more work to each thread-bunch, a better performance can be expected. The performance 
improvement mainly comes from higher on-chip resource reuse. 


5. Comparison to Related Methods 

In recent years, some new formats have been designed for SpMV operation on various pro 
cessor architectures. Because of less off-chip memory access and better on-chip memory local- 


pOSKI 1331, CSB IMIM], 
4711. attracted the most attention. 


ization, block-based formats or libraries, such as OSKI 11381 14: 

BELLPACK US], BCCOO/BCCOOh- 11, BRC 10 and RSB 
However, block-based formats heavily rely on sparsity structure, meaning that the input matrix 
is required to have a block structure to meet potential block layout. Therefore, block-based for¬ 
mats are mainly suitable for some matrices generated from scientific computation problems, but 
may not fit irregular matrices generated from graph applications. Our method proposed in this 
paper is insensitive to the sparsity structure of input matrix, thus a generally better performance 
is achieved. 

A lot of research has focused on improving row block method CSR-based SpMV. Williams 
et al. ifn proposed multiple optimization techniques for SpMV on multi-core CPUs and Cell 
B.E. processor. Nishtala et al. ll48l] designed a high-level data partitioning method for SpMV 
to achieve better cache locality on multicore CPUs. Pichel et al. ll49n evaluated how reordering 
techniques influence performance of SpMV on GPUs. Baskaran and Bordawekar ll^ irnproved 
off-chip and on-chip memory access patterns of SpMV on GPUs. Reguly and Giles 11511] im¬ 
proved thread cooperation for better GPU cache utilization. Ashari et al. 11 1511 utilized static 
reordering and the Dynamic Parallelism scheme offered by nVidia GPUs for fast SpMV opera¬ 
tion. Greathouse et al. 10 grouped contiguous rows for better runtime load balancing on GPUs. 
LightSpMV ?? proposed to dynamically distribute matrix rows over warps in order for more 
balanced CSR-based SpMV without the requirement of generating auxiliary data structures, and 
implemented this approach using atomic operations and warp shuffle functions as the fundamen¬ 
tal building blocks. However, again, the row block methods cannot achieve good performance 
for input matrix with dramatically imbalanced row distribution. In contrast, our method is inde¬ 
pendent with the sparsity structure of input matrix. 

Using segmented sum as a building block is potentially a better generic method for the CSR- 
based SpMV. An early segmented sum method GPU SpMV was introduced by Sengupta et 
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al. lEoll and Garland 121 ] and implemented in the cuDPP library But the cost of segmented 
sum and global memory access degrade overall SpMV performance. Zhang ifs^l improved back¬ 
ward segmented scan for a better cache efficiency and implemented the CSR-based SpMV on 
multicore CPUs. Recently, nVidia’s Modern GPU library IlStl implemented an improved reduc¬ 
tion method, which has been used as a back-end of cuDPP. However, its performance still suf¬ 
fered by pre- and post-processing empty rows in global memory space. Our method, in contrast, 
uses scratchpad memory more efficiently and utilizes the two types of cores in a heterogeneous 
processor for better workload distribution. 

Compared with our recent work CSR5 lll2|] . a format designed for cross-platform SpMV on 
CPUs, GPUs and Xeon Phi, the SpMV approach presented in this paper does not need to process 
any format conversion or generate any auxiliary data for the input CSR matrix. Consider the 
format conversion from the CSR to the CSR5 merely needs the cost of a few SpMV operations, 
the CSR5-based SpMV and the CSR-based SpMV can find their own application scenarios, such 
as solvers with different number of iterations. 


6. Conclusion and Future Work 


We proposed an efficient method for SpMV on heterogeneous processors using the CSR 
storage format. On three mainstream platforms from Intel, AMD and nVidia, our method greatly 
outperforms row block method CSR-based SpMV algorithms running on GPUs. The perfor¬ 
mance gain mainly comes from our newly developed speculative segmented sum strategy that 
efficiently utilizes different types of cores in a heterogeneous processor. 

In this work, we assign different task to the CPU part and the GPU part in one heteroge¬ 
neous processor. However, the heaviest workload (stage 1 in our method) currently only runs on 
GPU cores, while the CPU cores may be idle. Obviously, it is possible to schedule tasks in the 
first stage on both CPU cores and GPU cores simultaneously for potentially higher throughput. 
However, a heterogeneity-aware scheduling strategy is beyond the scope of the SpMV algorithm 


focused in this paper. We refer the reader to | ^ ^ 3 5a] for recent progress on utilizing both 


CPU cores and GPU cores in a heterogeneous environment. 
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Appendix A. Pseudo Code 


Algorithm 1 The SPMD pseudo code of a thread-bunch in speculative execution stage of the 
CSR-based SpMV 

1 : function speculative_execution_gpu() 

2: tb <— get-thread-globalid( )/T ; 

3: flpositioning row indices of tiles in the thread-bunch 

4: for / = 0 to 5 do 

5: boundary [/] tb X S xWxT + i xW xT 

6: tile_of f set [/] <— BiNARY_SEARCH(*row_pointer, boundary [/]) 

7: end for 

8: //iterative steps in a thread-bunch 

9: for / = 0 to 5 - 1 do 

10: Start <— tile_off set [/] 

11: stop <—tile_of f set [/-I-1] 

12: MEMSET(*seg_descriptor, FALSE) 

13: dirty <— FALSE 

14: //calculating segment descriptor 

15: for j — start to stop - 1 do 

16: if row_pointer [j] row_pointer [j-I-1] then 

17: descriptor [row 4 >ointer [j] - boundary [/]] <— TRUE 

18 : else 

19: dirty <— TRUE 

20 : end if 

21 : end for 

22: //collecting element-wise products 

23: for y = 0 to W X r - 1 do 

24: X-value <— x [column_index [boundary [/] + y] ] 

25: product [y] <— X-value X value [boundary [/] + y] 

26: end for 

27: //transmitting a value from the previous tile 

28: if descriptor [0] = FALSE then 

29: product [0] <— product [0] ■+■ transmitter 

30: descriptor [0] <— TRUE 

31: end if 
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Usegmented sum 

SEGMENTED_REDUCTiON(*product, ^descriptor, *ts, *tc) 
//calculating index offset in y 
*y_index <— EXCLUsivE_scAN(*tc) 

//saving partial sums to y 
for j - OtoT - 1 do 

for A: = 0 to tc [j] - 1 do 

index <— start + y_index [j] + k 
//first segment of the thread-bunch 
if index - tile_off set [0] then 
synchronizer [fA']./c/x <— index 

synchronizer \_tb'\.val <— product [y X Vi^+ts [j] +A:] 
else 

//storing to y directly 
y [i'nc/ex] <— product [j X W+ts [j] +A:] 

end if 

if index - stop then 

transmitter <— product Ij X Vi^+ts [j] +A:] 

end if 
end for 
end for 

//labeling dirty tile 
if dirty — TRUE then 

pos <— ATOMic_iNCREMENT(*dirty_counter) 
speculator [y)Oi] {start, stop) 
transmitter <— 0 

end if 
end for 
end function 

function synchronization_cpu() 

for / = 0 to lnnz/(S xW x T)] - 1 do 
index <— synchronizer 
value <— synchronizer [/].vaA 
y [i'nc/ex] <— y [/nc/ex] + value 
end for 
end function 
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